Our inspiration was from Kaggle competition - Instacart Market Basket Analysis which is also the data sets’ resource. Instacart is a grocery ordering and delivery application. They provide an anonymized dataset that contains a sample of over 3 million grocery orders from more than 200,000 Instacart’s users, and for each user, they provide between 4 and 100 of their orders, with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders (details of each data set will be introduced below).
Instacart hopes campaign participants test models for predicting products that a user will buy again, try for the first time or add to cart next during a session, which may need to use the the models such as XGBoost, word2vec and Annoy (Jeremy Stanley, May 3, 2017).
Repurchase predicting and order placement day predicting are the popular and helpful predictions among e-commerce companies. For example, it can be applied to do personalization, managing supply and demand, churn prediction, improved customer service, etc (Bigcommerce Blog, Nick Shaw). Amazon has already developed a patent called “anticipatory shipping” that can predict what and when people want to buy and ship packages even before customers have placed an order (The Economic Time, Jan 27, 2014). In this case, they can largely optimizing logistics management, human and equipment resources and inventory arrangement, so that it would help them to decrease cost and increase profit. Meantime, this type of prediction also requires much more information of customers’ behavior, such as items customers have searched for, the amount of time a user’s cursor hovers over a product, times of clicks by users, purchase conversion rate of users’ click, add to cart, collection and so on.
In this case, since there are limitation of information and we would like to apply what models we have learnt in the course, we prefer to predict the day of the week that the order will be placed. Then, this would be an additional predictor to support the demand forecasting which is useful to make a right direction in the decision-making process, like inventory arrangement, for the e-commerce platform.
Overall, we produce a new dataset based on what we have downloaded from the competition website, and assume that:
Thus, our research questions will be:
What day of the week a given order will be placed?
For this question, we will use supervised methods.
Are there any common components between departments or aisles?
For this question, we will use unsupervised methods.
The aisle table shows aisle unique ids under aisle_id column and aisle names under aisle column, for example aisle_id of 1 represents the prepared soup salads aisle. There are 134 ids in total and n/a is not found in the table.
| aisle_id | aisle |
|---|---|
| 1 | prepared soups salads |
| 2 | specialty cheeses |
| 3 | energy granola bars |
| 4 | instant foods |
| 5 | marinades meat preparation |
| 6 | other |
| 7 | packaged meat |
| 8 | bakery desserts |
| 9 | pasta sauce |
| 10 | kitchen supplies |
| 11 | cold flu allergy |
| 12 | fresh pasta |
| 13 | prepared meals |
| 14 | tofu meat alternatives |
| 15 | packaged seafood |
| 16 | fresh herbs |
| 17 | baking ingredients |
| 18 | bulk dried fruits vegetables |
| 19 | oils vinegars |
| 20 | oral hygiene |
| 21 | packaged cheese |
| 22 | hair care |
| 23 | popcorn jerky |
| 24 | fresh fruits |
| 25 | soap |
| 26 | coffee |
| 27 | beers coolers |
| 28 | red wines |
| 29 | honeys syrups nectars |
| 30 | latino foods |
| 31 | refrigerated |
| 32 | packaged produce |
| 33 | kosher foods |
| 34 | frozen meat seafood |
| 35 | poultry counter |
| 36 | butter |
| 37 | ice cream ice |
| 38 | frozen meals |
| 39 | seafood counter |
| 40 | dog food care |
| 41 | cat food care |
| 42 | frozen vegan vegetarian |
| 43 | buns rolls |
| 44 | eye ear care |
| 45 | candy chocolate |
| 46 | mint gum |
| 47 | vitamins supplements |
| 48 | breakfast bars pastries |
| 49 | packaged poultry |
| 50 | fruit vegetable snacks |
| 51 | preserved dips spreads |
| 52 | frozen breakfast |
| 53 | cream |
| 54 | paper goods |
| 55 | shave needs |
| 56 | diapers wipes |
| 57 | granola |
| 58 | frozen breads doughs |
| 59 | canned meals beans |
| 60 | trash bags liners |
| 61 | cookies cakes |
| 62 | white wines |
| 63 | grains rice dried goods |
| 64 | energy sports drinks |
| 65 | protein meal replacements |
| 66 | asian foods |
| 67 | fresh dips tapenades |
| 68 | bulk grains rice dried goods |
| 69 | soup broth bouillon |
| 70 | digestion |
| 71 | refrigerated pudding desserts |
| 72 | condiments |
| 73 | facial care |
| 74 | dish detergents |
| 75 | laundry |
| 76 | indian foods |
| 77 | soft drinks |
| 78 | crackers |
| 79 | frozen pizza |
| 80 | deodorants |
| 81 | canned jarred vegetables |
| 82 | baby accessories |
| 83 | fresh vegetables |
| 84 | milk |
| 85 | food storage |
| 86 | eggs |
| 87 | more household |
| 88 | spreads |
| 89 | salad dressing toppings |
| 90 | cocoa drink mixes |
| 91 | soy lactosefree |
| 92 | baby food formula |
| 93 | breakfast bakery |
| 94 | tea |
| 95 | canned meat seafood |
| 96 | lunch meat |
| 97 | baking supplies decor |
| 98 | juice nectars |
| 99 | canned fruit applesauce |
| 100 | missing |
| 101 | air fresheners candles |
| 102 | baby bath body care |
| 103 | ice cream toppings |
| 104 | spices seasonings |
| 105 | doughs gelatins bake mixes |
| 106 | hot dogs bacon sausage |
| 107 | chips pretzels |
| 108 | other creams cheeses |
| 109 | skin care |
| 110 | pickled goods olives |
| 111 | plates bowls cups flatware |
| 112 | bread |
| 113 | frozen juice |
| 114 | cleaning products |
| 115 | water seltzer sparkling water |
| 116 | frozen produce |
| 117 | nuts seeds dried fruit |
| 118 | first aid |
| 119 | frozen dessert |
| 120 | yogurt |
| 121 | cereal |
| 122 | meat counter |
| 123 | packaged vegetables fruits |
| 124 | spirits |
| 125 | trail mix snack mix |
| 126 | feminine care |
| 127 | body lotions soap |
| 128 | tortillas flat bread |
| 129 | frozen appetizers sides |
| 130 | hot cereal pancake mixes |
| 131 | dry pasta |
| 132 | beauty |
| 133 | muscles joints pain relief |
| 134 | specialty wines champagnes |
The department table shows department unique ids under department_id column and department names under department_id column, for example department_id of 1 represents the frozen department. There are 21 ids in total and n/a is not found in the table.
| department_id | department |
|---|---|
| 1 | frozen |
| 2 | other |
| 3 | bakery |
| 4 | produce |
| 5 | alcohol |
| 6 | international |
| 7 | beverages |
| 8 | pets |
| 9 | dry goods pasta |
| 10 | bulk |
| 11 | personal care |
| 12 | meat seafood |
| 13 | pantry |
| 14 | breakfast |
| 15 | canned goods |
| 16 | dairy eggs |
| 17 | household |
| 18 | babies |
| 19 | snacks |
| 20 | deli |
| 21 | missing |
The product table shows product unique ids under product_id column and department names under product_name column, for example product_id of 1 represents Chocolate Sandwich Cookies. This table also shows aisle_id and department_id that are associated with the product as well. There are approximately 50k ids in total and n/a is not found in the table.
| product_id | product_name | aisle_id | department_id |
|---|---|---|---|
| 1 | Chocolate Sandwich Cookies | 61 | 19 |
| 2 | All-Seasons Salt | 104 | 13 |
| 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
| 4 | Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce | 38 | 1 |
| 5 | Green Chile Anytime Sauce | 5 | 13 |
| 6 | Dry Nose Oil | 11 | 11 |
| 7 | Pure Coconut Water With Orange | 98 | 7 |
| 8 | Cut Russet Potatoes Steam N’ Mash | 116 | 1 |
| 9 | Light Strawberry Blueberry Yogurt | 120 | 16 |
| 10 | Sparkling Orange Juice & Prickly Pear Beverage | 115 | 7 |
| 11 | Peach Mango Juice | 31 | 7 |
| 12 | Chocolate Fudge Layer Cake | 119 | 1 |
| 13 | Saline Nasal Mist | 11 | 11 |
| 14 | Fresh Scent Dishwasher Cleaner | 74 | 17 |
| 15 | Overnight Diapers Size 6 | 56 | 18 |
| 16 | Mint Chocolate Flavored Syrup | 103 | 19 |
| 17 | Rendered Duck Fat | 35 | 12 |
| 18 | Pizza for One Suprema Frozen Pizza | 79 | 1 |
| 19 | Gluten Free Quinoa Three Cheese & Mushroom Blend | 63 | 9 |
| 20 | Pomegranate Cranberry & Aloe Vera Enrich Drink | 98 | 7 |
| 21 | Small & Medium Dental Dog Treats | 40 | 8 |
| 22 | Fresh Breath Oral Rinse Mild Mint | 20 | 11 |
| 23 | Organic Turkey Burgers | 49 | 12 |
| 24 | Tri-Vi-Sol® Vitamins A-C-and D Supplement Drops for Infants | 47 | 11 |
| 25 | Salted Caramel Lean Protein & Fiber Bar | 3 | 19 |
| 26 | Fancy Feast Trout Feast Flaked Wet Cat Food | 41 | 8 |
| 27 | Complete Spring Water Foaming Antibacterial Hand Wash | 127 | 11 |
| 28 | Wheat Chex Cereal | 121 | 14 |
| 29 | Fresh Cut Golden Sweet No Salt Added Whole Kernel Corn | 81 | 15 |
| 30 | Three Cheese Ziti, Marinara with Meatballs | 38 | 1 |
| 31 | White Pearl Onions | 123 | 4 |
| 32 | Nacho Cheese White Bean Chips | 107 | 19 |
| 33 | Organic Spaghetti Style Pasta | 131 | 9 |
| 34 | Peanut Butter Cereal | 121 | 14 |
| 35 | Italian Herb Porcini Mushrooms Chicken Sausage | 106 | 12 |
| 36 | Traditional Lasagna with Meat Sauce Savory Italian Recipes | 38 | 1 |
| 37 | Noodle Soup Mix With Chicken Broth | 69 | 15 |
| 38 | Ultra Antibacterial Dish Liquid | 100 | 21 |
| 39 | Daily Tangerine Citrus Flavored Beverage | 64 | 7 |
| 40 | Beef Hot Links Beef Smoked Sausage With Chile Peppers | 106 | 12 |
| 41 | Organic Sourdough Einkorn Crackers Rosemary | 78 | 19 |
| 42 | Biotin 1000 mcg | 47 | 11 |
| 43 | Organic Clementines | 123 | 4 |
| 44 | Sparkling Raspberry Seltzer | 115 | 7 |
| 45 | European Cucumber | 83 | 4 |
| 46 | Raisin Cinnamon Bagels 5 count | 58 | 1 |
| 47 | Onion Flavor Organic Roasted Seaweed Snack | 66 | 6 |
| 48 | School Glue, Washable, No Run | 87 | 17 |
| 49 | Vegetarian Grain Meat Sausages Italian - 4 CT | 14 | 20 |
| 50 | Pumpkin Muffin Mix | 105 | 13 |
This table shows the details of all order in the training data set provided by Instacart. It shows product_ids that are purchased in each order. For example, the order_id of 1 consists of 8 products, including the following product ids: 13176, 47209, 22035, etc. Also, there are ~131k orders in total and there is no n/a in the table.
| order_id | product_id | add_to_cart_order | reordered |
|---|---|---|---|
| 1 | 49302 | 1 | 1 |
| 1 | 11109 | 2 | 1 |
| 1 | 10246 | 3 | 0 |
| 1 | 49683 | 4 | 0 |
| 1 | 43633 | 5 | 1 |
| 1 | 13176 | 6 | 0 |
| 1 | 47209 | 7 | 0 |
| 1 | 22035 | 8 | 1 |
| 36 | 39612 | 1 | 0 |
| 36 | 19660 | 2 | 1 |
| 36 | 49235 | 3 | 0 |
| 36 | 43086 | 4 | 1 |
| 36 | 46620 | 5 | 1 |
| 36 | 34497 | 6 | 1 |
| 36 | 48679 | 7 | 1 |
| 36 | 46979 | 8 | 1 |
| 38 | 11913 | 1 | 0 |
| 38 | 18159 | 2 | 0 |
| 38 | 4461 | 3 | 0 |
| 38 | 21616 | 4 | 1 |
| 38 | 23622 | 5 | 0 |
| 38 | 32433 | 6 | 0 |
| 38 | 28842 | 7 | 0 |
| 38 | 42625 | 8 | 0 |
| 38 | 39693 | 9 | 0 |
| 96 | 20574 | 1 | 1 |
| 96 | 30391 | 2 | 0 |
| 96 | 40706 | 3 | 1 |
| 96 | 25610 | 4 | 0 |
| 96 | 27966 | 5 | 1 |
| 96 | 24489 | 6 | 1 |
| 96 | 39275 | 7 | 1 |
| 98 | 8859 | 1 | 1 |
| 98 | 19731 | 2 | 1 |
| 98 | 43654 | 3 | 1 |
| 98 | 13176 | 4 | 1 |
| 98 | 4357 | 5 | 1 |
| 98 | 37664 | 6 | 1 |
| 98 | 34065 | 7 | 1 |
| 98 | 35951 | 8 | 1 |
| 98 | 43560 | 9 | 1 |
| 98 | 9896 | 10 | 1 |
| 98 | 27509 | 11 | 1 |
| 98 | 15455 | 12 | 1 |
| 98 | 27966 | 13 | 1 |
| 98 | 47601 | 14 | 1 |
| 98 | 40396 | 15 | 1 |
| 98 | 35042 | 16 | 1 |
| 98 | 40986 | 17 | 1 |
| 98 | 1939 | 18 | 1 |
This table shows the purchase time (day of week under order_dow column and hour of day under order_hour_of_day column) for each order. For example, the order_id of 1187899 has order_dow of 4 and order_hour of day of 8. This means this order was maded on Thursday (order_dow = 4) at 8am (order_hour = 8).
| order_id | order_dow | order_hour_of_day |
|---|---|---|
| 1187899 | 4 | 8 |
| 1492625 | 1 | 11 |
| 2196797 | 0 | 11 |
| 525192 | 2 | 11 |
| 880375 | 1 | 14 |
| 1094988 | 6 | 10 |
| 1822501 | 0 | 19 |
| 1827621 | 0 | 21 |
| 2316178 | 2 | 19 |
| 2180313 | 3 | 10 |
| 2461523 | 6 | 9 |
| 1854765 | 1 | 12 |
| 3402036 | 1 | 12 |
| 965160 | 0 | 16 |
| 2614670 | 5 | 14 |
| 3110252 | 4 | 11 |
| 62370 | 2 | 13 |
| 698604 | 4 | 13 |
| 1524161 | 0 | 13 |
| 3173750 | 0 | 9 |
| 2032076 | 0 | 20 |
| 2803975 | 0 | 11 |
| 1864787 | 5 | 11 |
| 2436259 | 0 | 12 |
| 1947848 | 4 | 20 |
| 2906490 | 4 | 22 |
| 2924697 | 5 | 18 |
| 519514 | 4 | 12 |
| 1750084 | 3 | 9 |
| 1647290 | 4 | 16 |
| 3088145 | 2 | 10 |
| 39325 | 2 | 18 |
| 13318 | 1 | 9 |
| 1651215 | 0 | 12 |
| 1019719 | 2 | 12 |
| 2989905 | 6 | 8 |
| 2639013 | 0 | 13 |
| 1072954 | 6 | 17 |
| 34647 | 3 | 19 |
| 2757217 | 0 | 11 |
| 669729 | 5 | 12 |
| 3038639 | 5 | 13 |
| 2608424 | 2 | 14 |
| 482516 | 4 | 7 |
| 3294399 | 4 | 8 |
| 1700658 | 6 | 11 |
| 21708 | 0 | 6 |
| 2178718 | 2 | 8 |
| 1734166 | 5 | 18 |
| 859654 | 1 | 10 |
This table joins Table 1 - 5 together. Thus, this table will include all necessary information that we need in the analysis, including order_id, purchase time, aisle and department. Please note that we do not include product_id and product_name in this table because the dimensionality is too large (over 50k categories). Thus, in our analysis, we will mainly use department_id (21 cetegories) in our analysis and use aisle_id (134 categories) in the PCA part in the unsupervised learning section.
| order_id | order_dow | order_hour_of_day | aisle_id | aisle | department_id | department |
|---|---|---|---|---|---|---|
| 1187899 | 4 | 8 | 77 | soft drinks | 7 | beverages |
| 1187899 | 4 | 8 | 21 | packaged cheese | 16 | dairy eggs |
| 1187899 | 4 | 8 | 120 | yogurt | 16 | dairy eggs |
| 1187899 | 4 | 8 | 54 | paper goods | 17 | household |
| 1187899 | 4 | 8 | 45 | candy chocolate | 19 | snacks |
| 1187899 | 4 | 8 | 117 | nuts seeds dried fruit | 19 | snacks |
| 1187899 | 4 | 8 | 121 | cereal | 14 | breakfast |
| 1187899 | 4 | 8 | 23 | popcorn jerky | 19 | snacks |
| 1187899 | 4 | 8 | 84 | milk | 16 | dairy eggs |
| 1187899 | 4 | 8 | 53 | cream | 16 | dairy eggs |
| 1187899 | 4 | 8 | 77 | soft drinks | 7 | beverages |
| 1492625 | 1 | 11 | 96 | lunch meat | 20 | deli |
| 1492625 | 1 | 11 | 58 | frozen breads doughs | 1 | frozen |
| 1492625 | 1 | 11 | 107 | chips pretzels | 19 | snacks |
| 1492625 | 1 | 11 | 23 | popcorn jerky | 19 | snacks |
| 1492625 | 1 | 11 | 24 | fresh fruits | 4 | produce |
| 1492625 | 1 | 11 | 24 | fresh fruits | 4 | produce |
| 1492625 | 1 | 11 | 24 | fresh fruits | 4 | produce |
| 1492625 | 1 | 11 | 24 | fresh fruits | 4 | produce |
| 1492625 | 1 | 11 | 24 | fresh fruits | 4 | produce |
| 1492625 | 1 | 11 | 24 | fresh fruits | 4 | produce |
| 1492625 | 1 | 11 | 24 | fresh fruits | 4 | produce |
| 1492625 | 1 | 11 | 91 | soy lactosefree | 16 | dairy eggs |
| 1492625 | 1 | 11 | 46 | mint gum | 19 | snacks |
| 1492625 | 1 | 11 | 96 | lunch meat | 20 | deli |
| 1492625 | 1 | 11 | 80 | deodorants | 11 | personal care |
| 1492625 | 1 | 11 | 1 | prepared soups salads | 20 | deli |
| 1492625 | 1 | 11 | 38 | frozen meals | 1 | frozen |
| 1492625 | 1 | 11 | 38 | frozen meals | 1 | frozen |
| 1492625 | 1 | 11 | 38 | frozen meals | 1 | frozen |
| 1492625 | 1 | 11 | 38 | frozen meals | 1 | frozen |
| 1492625 | 1 | 11 | 38 | frozen meals | 1 | frozen |
| 1492625 | 1 | 11 | 38 | frozen meals | 1 | frozen |
| 1492625 | 1 | 11 | 38 | frozen meals | 1 | frozen |
| 1492625 | 1 | 11 | 69 | soup broth bouillon | 15 | canned goods |
| 1492625 | 1 | 11 | 37 | ice cream ice | 1 | frozen |
| 1492625 | 1 | 11 | 37 | ice cream ice | 1 | frozen |
| 1492625 | 1 | 11 | 37 | ice cream ice | 1 | frozen |
| 1492625 | 1 | 11 | 117 | nuts seeds dried fruit | 19 | snacks |
| 1492625 | 1 | 11 | 3 | energy granola bars | 19 | snacks |
| 1492625 | 1 | 11 | 69 | soup broth bouillon | 15 | canned goods |
| 1492625 | 1 | 11 | 69 | soup broth bouillon | 15 | canned goods |
| 2196797 | 0 | 11 | 29 | honeys syrups nectars | 13 | pantry |
| 2196797 | 0 | 11 | 24 | fresh fruits | 4 | produce |
| 2196797 | 0 | 11 | 21 | packaged cheese | 16 | dairy eggs |
| 2196797 | 0 | 11 | 66 | asian foods | 6 | international |
| 2196797 | 0 | 11 | 101 | air fresheners candles | 17 | household |
| 2196797 | 0 | 11 | 83 | fresh vegetables | 4 | produce |
| 2196797 | 0 | 11 | 66 | asian foods | 6 | international |
| 2196797 | 0 | 11 | 123 | packaged vegetables fruits | 4 | produce |
The plot shows the distribution of the order by time (dow and hour).
| order_id | order_dow | order_hour_of_day |
|---|---|---|
| 1187899 | 4 | 8 |
| 1492625 | 1 | 11 |
| 2196797 | 0 | 11 |
| 525192 | 2 | 11 |
| 880375 | 1 | 14 |
| 1094988 | 6 | 10 |
| 1822501 | 0 | 19 |
| 1827621 | 0 | 21 |
| 2316178 | 2 | 19 |
| 2180313 | 3 | 10 |
| 2461523 | 6 | 9 |
| 1854765 | 1 | 12 |
| 3402036 | 1 | 12 |
| 965160 | 0 | 16 |
| 2614670 | 5 | 14 |
| 3110252 | 4 | 11 |
| 62370 | 2 | 13 |
| 698604 | 4 | 13 |
| 1524161 | 0 | 13 |
| 3173750 | 0 | 9 |
| 2032076 | 0 | 20 |
| 2803975 | 0 | 11 |
| 1864787 | 5 | 11 |
| 2436259 | 0 | 12 |
| 1947848 | 4 | 20 |
| 2906490 | 4 | 22 |
| 2924697 | 5 | 18 |
| 519514 | 4 | 12 |
| 1750084 | 3 | 9 |
| 1647290 | 4 | 16 |
| 3088145 | 2 | 10 |
| 39325 | 2 | 18 |
| 13318 | 1 | 9 |
| 1651215 | 0 | 12 |
| 1019719 | 2 | 12 |
| 2989905 | 6 | 8 |
| 2639013 | 0 | 13 |
| 1072954 | 6 | 17 |
| 34647 | 3 | 19 |
| 2757217 | 0 | 11 |
| 669729 | 5 | 12 |
| 3038639 | 5 | 13 |
| 2608424 | 2 | 14 |
| 482516 | 4 | 7 |
| 3294399 | 4 | 8 |
| 1700658 | 6 | 11 |
| 21708 | 0 | 6 |
| 2178718 | 2 | 8 |
| 1734166 | 5 | 18 |
| 859654 | 1 | 10 |
We can observe on the left chart oder_dow that the most frequent days of ordering are Sunday’s and Monday’s comparing to the rest of the week, and on the right chart order_hour_of_day,we note a high demand of orders between 9am to 6pm.
This table shows the top 10 aisles by the number of purchase. We can see that the most purchase aisles are fresh vegetables and fresh fruits (~150k orders each).
| aisle | department | total_order |
|---|---|---|
| fresh vegetables | produce | 150609 |
| fresh fruits | produce | 150473 |
| packaged vegetables fruits | produce | 78493 |
| yogurt | dairy eggs | 55240 |
| packaged cheese | dairy eggs | 41699 |
| water seltzer sparkling water | beverages | 36617 |
| milk | dairy eggs | 32644 |
| chips pretzels | snacks | 31269 |
| soy lactosefree | dairy eggs | 26240 |
| bread | bakery | 23635 |
This table shows the top 10 departments by the number of purchase. We can see that the most purchase aisles is produce (~409k orders).
| department | total_order |
|---|---|
| produce | 409087 |
| dairy eggs | 217051 |
| snacks | 118862 |
| beverages | 114046 |
| frozen | 100426 |
| pantry | 81242 |
| bakery | 48394 |
| canned goods | 46799 |
| deli | 44291 |
| dry goods pasta | 38713 |
Here, we would like to observe the pattern of sales in depth by splitting into departments. First, it is the pattern of weekly sales.
From these graphs, we could observe the patterns as follow:
We will use PCA to analyze whether if we could reduce dimension of the data set (The number of order by department).
PCA explains the similarity of variables. There are two metrics which are correlation(scaled) and covariance(non scaled). In our analysis, we focus on the relationship between the number of order from each department and day-of-week that users purchase. Thus, we will focus our PCA analysis on non-scale, i.e. using covariance. However, it would be interesting to see the differences of the results between scale and non-scaled PCAs as well, so we will also perform the PCA analysis with correlations.
We observe that the first and second components explain 46.7% and 13.8% of variance of the data. Referring to the rule of thumb which selects the number of dimensions that allow to explain at least 75% of the variation, therefore comp1 - comp5 are selected and around 79.8% of variance of the data are explained.
Our findings are as follows:
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 12.642 | 46.69 | 46.7 |
| comp 2 | 3.727 | 13.76 | 60.5 |
| comp 3 | 2.130 | 7.87 | 68.3 |
| comp 4 | 1.577 | 5.82 | 74.1 |
| comp 5 | 1.535 | 5.67 | 79.8 |
| comp 6 | 1.115 | 4.12 | 83.9 |
| comp 7 | 0.647 | 2.39 | 86.3 |
| comp 8 | 0.610 | 2.25 | 88.6 |
| comp 9 | 0.515 | 1.90 | 90.5 |
| comp 10 | 0.469 | 1.73 | 92.2 |
We find that the first and second components can explain only 13.6% and 6.6% respectively, and we need 15 components (out of 21) to explain 75% of the variation. This means that correlations between departments are very low and we cannot use PCA to reduce the dimensions of the scaled data.
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 2.861 | 13.62 | 13.6 |
| comp 2 | 1.382 | 6.58 | 20.2 |
| comp 3 | 1.167 | 5.55 | 25.8 |
| comp 4 | 1.049 | 5.00 | 30.8 |
| comp 5 | 1.035 | 4.93 | 35.7 |
| comp 6 | 1.008 | 4.80 | 40.5 |
| comp 7 | 0.990 | 4.71 | 45.2 |
| comp 8 | 0.972 | 4.63 | 49.8 |
| comp 9 | 0.944 | 4.49 | 54.3 |
| comp 10 | 0.931 | 4.43 | 58.8 |
| comp 11 | 0.903 | 4.30 | 63.1 |
| comp 12 | 0.874 | 4.16 | 67.2 |
| comp 13 | 0.871 | 4.15 | 71.4 |
| comp 14 | 0.839 | 4.00 | 75.4 |
| comp 15 | 0.807 | 3.84 | 79.2 |
Before starting applying the models to the data, we have decided to aggregate the column called id_orders by department, so we could know the number of products purchased by department. In addition, we have considered to keep the column order_dow, to identify on which day of the week an order was purchased.
After creating this new table, we converted the column order_dow from numeric(int) to categorical values(factor), and to understand better this values, we change the integer values to the name of the day of the week. For example: The value “0” was transformed to “Sunday”, “1” to “Monday”, “2” to “Tuesday”, and so on.
Moreover, we have decided to split our new table into two, to ensure that the model will not overfit the data and that the results of the predictions are good. To do so, we select for the first set; our training set, 80% of the observations randomly(around 105k obs), and for the observations that remain we took them as our test set(around 26k obs).
| order_dow | canned.goods | dairy.eggs | produce | beverages | deli | frozen | pantry | snacks | bakery | household | meat.seafood | personal.care | dry.goods.pasta | babies | missing | other | breakfast | international | alcohol | bulk | pets |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Thursday | 1 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saturday | 0 | 3 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saturday | 0 | 0 | 6 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saturday | 0 | 0 | 4 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Wednesday | 8 | 11 | 7 | 4 | 3 | 3 | 4 | 1 | 1 | 5 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Friday | 0 | 0 | 4 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 2 | 9 | 0 | 0 | 0 | 1 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| Wednesday | 0 | 4 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saturday | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 0 | 2 | 0 | 0 | 1 | 3 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saturday | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Wednesday | 0 | 4 | 7 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 |
| Monday | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saturday | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Saturday | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tuesday | 0 | 4 | 19 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Thursday | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Monday | 1 | 1 | 0 | 5 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tuesday | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Thursday | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Monday | 0 | 2 | 7 | 10 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Monday | 1 | 2 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Saturday | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Wednesday | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Monday | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Friday | 0 | 0 | 3 | 0 | 2 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 0 | 9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Friday | 1 | 5 | 3 | 0 | 0 | 7 | 0 | 4 | 3 | 0 | 1 | 0 | 3 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 |
| Sunday | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Thursday | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Friday | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Wednesday | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Monday | 0 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Wednesday | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Thursday | 0 | 0 | 0 | 5 | 0 | 0 | 1 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Sunday | 0 | 1 | 14 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tuesday | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Monday | 0 | 2 | 7 | 3 | 1 | 1 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Friday | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tuesday | 0 | 4 | 6 | 3 | 2 | 3 | 1 | 7 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 |
| Monday | 0 | 4 | 3 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Friday | 3 | 1 | 7 | 1 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Monday | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tuesday | 2 | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tuesday | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Friday | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Our goal is to determine which day of the week a given order will be placed. Since we have transformed the column order_dow as a factor with categorical values, we will apply models that consider a classification task.
We have chosen the models as follows:
In addition, we will implement to each of the models some of the following approaches:
Decision trees are algorithms that recursively search the space for the best boundary possible, until we unable them to do so (Ivo Bernardo,2021). The basic functionality of decision trees is to split the data space into rectangles, by measuring each split. The main goal is to minimize the impurity of each split from the previous one.
One day of the week - Unbalanced data
For this approach we want to measure the accuracy of the model with the unbalanced data. Furthermore, it will be interesting to see which departments were considered the best to split the data into days of the week to later be compared to a balanced data with cross-validation (second approach).
According to the pruned tree, we observe that the department produce have the most relevance within the departments, this could be influenced by the fact that this department has the highest number of products purchased in our data set. Furthermore, the tree show us that with an amount of products purchased higher or equal than 3, the model will classify the day of the week as Sunday, if it is lower than 3 the tree will split into another node containing the frozen department.
Likewise, the same procedure will be consider for this node and the following, they will start from the previous node and will try to minimize the impurity at each split. It should be noted that we cannot observed on the terminal nodes all the days of the week, because of the way in which the trees are generated. For the same reason, we expect on the prediction of the test set, a prediction value of “0” on the days of the week different to Sunday and Monday.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#> Friday 0 0 0 0 0 0 0
#> Monday 694 706 578 650 708 685 661
#> Saturday 0 0 0 0 0 0 0
#> Sunday 2787 3228 3202 4843 2483 2538 2476
#> Thursday 0 0 0 0 0 0 0
#> Tuesday 0 0 0 0 0 0 0
#> Wednesday 0 0 0 0 0 0 0
#>
#> Overall Statistics
#>
#> Accuracy : 0.211
#> 95% CI : (0.207, 0.216)
#> No Information Rate : 0.209
#> P-Value [Acc > NIR] : 0.2
#>
#> Kappa : 0.016
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Friday Class: Monday Class: Saturday
#> Sensitivity 0.000 0.1795 0.000
#> Specificity 1.000 0.8217 1.000
#> Pos Pred Value NaN 0.1508 NaN
#> Neg Pred Value 0.867 0.8503 0.856
#> Prevalence 0.133 0.1499 0.144
#> Detection Rate 0.000 0.0269 0.000
#> Detection Prevalence 0.000 0.1784 0.000
#> Balanced Accuracy 0.500 0.5006 0.500
#> Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity 0.882 0.000 0.000
#> Specificity 0.194 1.000 1.000
#> Pos Pred Value 0.225 NaN NaN
#> Neg Pred Value 0.861 0.878 0.877
#> Prevalence 0.209 0.122 0.123
#> Detection Rate 0.185 0.000 0.000
#> Detection Prevalence 0.822 0.000 0.000
#> Balanced Accuracy 0.538 0.500 0.500
#> Class: Wednesday
#> Sensitivity 0.00
#> Specificity 1.00
#> Pos Pred Value NaN
#> Neg Pred Value 0.88
#> Prevalence 0.12
#> Detection Rate 0.00
#> Detection Prevalence 0.00
#> Balanced Accuracy 0.50
As expected, only on Sunday and Monday we get predictive values for all the days of the week, while in the rest we get zero. Overall, the accuracy of this model is low with a score of 0.21, meaning that the model have ((0.21 - (1/7))= 0.077) around 8% of accuracy classifying the days of the week. It is important to recall that there is a big difference between sensitivity and specificity because our data is not balanced.
One day of the week - Balanced data with Sub-sampling and Cross-Validation
Now for this approach, we will balanced the data with sub-sampling and make the overall score more robust by applying to the model a cross-validation technique, this will help us to find the best set of hyperparameters.
#> .outcome Fri Mon Sat Sun Thu Tue Wed cover
#> Saturday [.14 .13 .16 .14 .14 .14 .15] when produce < 3 & frozen >= 1 18%
#> Sunday [.13 .15 .15 .18 .13 .13 .13] when produce >= 3 44%
#> Thursday [.15 .14 .13 .11 .16 .15 .16] when produce < 3 & frozen < 1 38%
The left column (.outcome) of the rules show the day that was selected for the terminal node (the one with the highest probability) and next to it the probability of each day of the week for the department selected. In this case for the last rule it seems that Wednesday and Thursday have the same probability because of the rounding, but Thursday its 0.003 above Wednesday, this can be seen from the tree plot.
The rightmost column (cover) gives the percentage of observations in each rule. The first rule says that Saturday will be chosen when the department produce is lower than 3 and higher or equal than 1 with a probability of 18%. Then we can look at the results of the model in the Confusion Matrix.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#> Friday 0 0 0 0 0 0 0
#> Monday 0 0 0 0 0 0 0
#> Saturday 678 645 751 904 595 587 583
#> Sunday 1454 1797 1768 2992 1235 1291 1259
#> Thursday 1349 1492 1261 1597 1361 1345 1295
#> Tuesday 0 0 0 0 0 0 0
#> Wednesday 0 0 0 0 0 0 0
#>
#> Overall Statistics
#>
#> Accuracy : 0.195
#> 95% CI : (0.19, 0.199)
#> No Information Rate : 0.209
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.035
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Friday Class: Monday Class: Saturday
#> Sensitivity 0.000 0.00 0.1987
#> Specificity 1.000 1.00 0.8223
#> Pos Pred Value NaN NaN 0.1583
#> Neg Pred Value 0.867 0.85 0.8591
#> Prevalence 0.133 0.15 0.1441
#> Detection Rate 0.000 0.00 0.0286
#> Detection Prevalence 0.000 0.00 0.1808
#> Balanced Accuracy 0.500 0.50 0.5105
#> Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity 0.545 0.4265 0.000
#> Specificity 0.576 0.6382 1.000
#> Pos Pred Value 0.254 0.1403 NaN
#> Neg Pred Value 0.827 0.8894 0.877
#> Prevalence 0.209 0.1216 0.123
#> Detection Rate 0.114 0.0519 0.000
#> Detection Prevalence 0.450 0.3697 0.000
#> Balanced Accuracy 0.560 0.5324 0.500
#> Class: Wednesday
#> Sensitivity 0.00
#> Specificity 1.00
#> Pos Pred Value NaN
#> Neg Pred Value 0.88
#> Prevalence 0.12
#> Detection Rate 0.00
#> Detection Prevalence 0.00
#> Balanced Accuracy 0.50
From the confusion matrix we observe a better result between the sensitivity and specificity across the classes, if we compare the previous model with this one, we notice that on the class Sunday the values of the sensitivity changed from 0.882 to 0.545, and for the specificity from 0.194 to 0.576. As expected, the Accuracy has decreased from 0.211 to 0.195, meaning that the model have ((0.195 - (1/7))= 0.052) around 5% of accuracy per day of the week, but the Balanced Accuracy is better. This model would be preferred than the one used with unbalanced data.
Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation
For the Final approach we transformed the levels of the column order_dow into two, one for the days during the week and the remaining for the weekend. On top of that we balanced our levels “weekday” and “weekend” and consider a Cross-Validation to train the model.
We try to plot the final tree computed by the model, but it was not possible to interpret it, due to the overlapping nodes shown in the graph, but we could see that the departments produce, frozen and meat.seafood, were among the first splits.
The results of the confusion matrix show us a well balanced data from what we can observed in the sensitivity and specificity. The Accuracy of the model is similar comparing it with the other two approaches, the model have ((0.533 - (1/2))= 0.033) 3% of accuracy. Overall, all different approaches have a low score at predicting the day of the week based on the department purchases from previous orders.
Random Forest (RF) are algorithms of a set of decision trees that will produce a final prediction with the average outcome of the set of trees considered (user can define the amount of trees and the number of variables for each node). One of the reasons that we decided to test this method is because RF are considered to be more stable than Decision Trees; more trees better performance, but certain advantages come at a price. RF slow down the computation speed and cannot be visualize, however, we will look at the results for later comparison (Saikumar Talari, 2022).
Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation
For this method we will consider the same approach as the last one of Classification Tree. We faced some computation speed problems while running the model, for that reason we decided to considered only 10,000 orders to reduce the waiting time.
As expected, the Accuracy of the model is higher than the Classification Tree as well as Cohen’s Kappa and the balanced accuracy. This model would be preferred at predicting “weekday” and “weekend” as it has better results.
One day of the week - Unbalanced data
Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes (Wikipedia,2021). Like binary logistic regression, multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership.
Our first approach is to predict the day of the week that the order will be placed according to the product composition in the order. Since there are 7 days in a week so it is not a binary logistic regression problem but a multinomial logistic regression problem.
We select Sunday as the reference level. To build the model, we use the number of products in each department of the order as explanatory variables.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#> Friday 129 99 91 79 93 94 102
#> Monday 69 87 39 54 95 89 85
#> Saturday 39 46 50 51 36 29 24
#> Sunday 3208 3673 3584 5278 2931 2981 2900
#> Thursday 20 19 9 21 24 18 18
#> Tuesday 0 0 0 0 0 0 0
#> Wednesday 16 10 7 10 12 12 8
#>
#> Overall Statistics
#>
#> Accuracy : 0.213
#> 95% CI : (0.208, 0.218)
#> No Information Rate : 0.209
#> P-Value [Acc > NIR] : 0.105
#>
#> Kappa : 0.01
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Statistics by Class:
#>
#> Class: Friday Class: Monday Class: Saturday
#> Sensitivity 0.03706 0.02211 0.01323
#> Specificity 0.97548 0.98068 0.98998
#> Pos Pred Value 0.18777 0.16795 0.18182
#> Neg Pred Value 0.86882 0.85043 0.85634
#> Prevalence 0.13267 0.14993 0.14406
#> Detection Rate 0.00492 0.00332 0.00191
#> Detection Prevalence 0.02618 0.01974 0.01048
#> Balanced Accuracy 0.50627 0.50140 0.50160
#> Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity 0.9609 0.007521 0.000
#> Specificity 0.0708 0.995444 1.000
#> Pos Pred Value 0.2149 0.186047 NaN
#> Neg Pred Value 0.8723 0.878705 0.877
#> Prevalence 0.2093 0.121613 0.123
#> Detection Rate 0.2012 0.000915 0.000
#> Detection Prevalence 0.9358 0.004916 0.000
#> Balanced Accuracy 0.5158 0.501483 0.500
#> Class: Wednesday
#> Sensitivity 0.002550
#> Specificity 0.997100
#> Pos Pred Value 0.106667
#> Neg Pred Value 0.880408
#> Prevalence 0.119555
#> Detection Rate 0.000305
#> Detection Prevalence 0.002858
#> Balanced Accuracy 0.499825
According to the confusion matrix, the accuracy(0.213) is low and there is a big difference between sensitivity and specificity in each class. For example, the sensitivity of class Friday is 0.037 while the specificity of class Friday is 0.975. Also the kappa(0.01) is very small which means the observed accuracy is only a little higher than the accuracy that one would expect from a random model. We try to balance the data and use a cross-validation to improve the model accuracy.
One day of the week - balanced data with cross-validation
Before balancing the data, We need to check the frequency of each class. The class Wednesday has the smallest frequency(12550). We will balance data by sub-sampling according to the frequency of class Wednesday.
#>
#> Sunday Friday Monday Saturday Thursday Tuesday
#> 21972 13925 15738 15121 12768 12896
#> Wednesday
#> 12550
#>
#> Sunday Friday Monday Saturday Thursday Tuesday
#> 12550 12550 12550 12550 12550 12550
#> Wednesday
#> 12550
We only sub-sample the data without applying cross-validation. Now every class has the same frequency(12550).
We try the cross-validation with the sub-sampling data by using the train function of caret package, but the data set is too big and it takes a very long time to run it so we decide not to include the cross-validation.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#> Friday 245 258 222 297 203 203 200
#> Monday 348 448 383 629 310 347 329
#> Saturday 403 436 459 684 348 345 336
#> Sunday 797 1033 1064 1868 664 735 698
#> Thursday 737 760 715 869 718 709 689
#> Tuesday 356 371 360 453 320 282 290
#> Wednesday 595 628 577 693 628 602 595
#>
#> Overall Statistics
#>
#> Accuracy : 0.176
#> 95% CI : (0.171, 0.181)
#> No Information Rate : 0.209
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.03
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Statistics by Class:
#>
#> Class: Friday Class: Monday Class: Saturday
#> Sensitivity 0.07038 0.1139 0.1214
#> Specificity 0.93923 0.8948 0.8864
#> Pos Pred Value 0.15049 0.1603 0.1524
#> Neg Pred Value 0.86851 0.8513 0.8570
#> Prevalence 0.13267 0.1499 0.1441
#> Detection Rate 0.00934 0.0171 0.0175
#> Detection Prevalence 0.06205 0.1065 0.1148
#> Balanced Accuracy 0.50481 0.5044 0.5039
#> Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity 0.3401 0.2250 0.0875
#> Specificity 0.7594 0.8057 0.9066
#> Pos Pred Value 0.2723 0.1382 0.1160
#> Neg Pred Value 0.8130 0.8825 0.8765
#> Prevalence 0.2093 0.1216 0.1228
#> Detection Rate 0.0712 0.0274 0.0107
#> Detection Prevalence 0.2614 0.1981 0.0927
#> Balanced Accuracy 0.5497 0.5153 0.4970
#> Class: Wednesday
#> Sensitivity 0.1897
#> Specificity 0.8388
#> Pos Pred Value 0.1378
#> Neg Pred Value 0.8840
#> Prevalence 0.1196
#> Detection Rate 0.0227
#> Detection Prevalence 0.1646
#> Balanced Accuracy 0.5143
From the confusion matrix report we can notice that there is an improvement on the difference between sensitivity and specificity of each class. For example, the sensitivity and specificity of class Thursday are 0.007 and 0.995 in the previous model. After balancing the data, now the sensitivity and specificity of class Thursday are 0.231 and 0.801. The kappa is also higher(from 0.01 to 0.03)
Weekdays and Weekend - Balanced data and Cross-Validation
The logistic regression is a regression adapted to binary classification. The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear combination is transformed to a probability using a sigmoid function.
In order to further improve our model quality, we think about aggregating the classes of the day of week. Usually the buying behavior is different between weekday and weekend. So we separate the day of week into two classes weekday and weekend.
Now the outcome variable has only two categories so we can use the binomial logistic regression.
According to the confusion matrix the balanced accuracy is higher and the difference between sensitivity(0.62) and specificity(0.48) is even smaller. Now the Kappa is 0.10, higher than Cohen’s Kappa previous model(0.03), and the Accuracy is 0.57.
Comparing the result of this model against the previous ones, we note that Random forest model is the one that resembles these results, only Logistic regression is slightly higher on the Accuracy by 0.008, on Cohen’sKappa by 0.010, and on the Balanced Accuracy by 0.005. So, for those reasons we have decided choosing this model over the rest.
Variable importance is a method that provides a measure of the importance of each feature for the model prediction quality. We analyze the variables importance of our 4 models. There are 21 explanatory variables in each of our models and we only show the top 10 most important variables in the plots.
Note: We faced some computation speed problems while running the the Variable Importance for all the models, for that reason we decided to considered the same amount of observations as the Random Forest model(10k) to reduce the waiting time.
As we note from this chart the models Classification Tree, Random Forest, and Logistic Regression use the AUC loss to compare the model quality of shuffling different variables. AUC is a synthetic measure of the distance to random model in the ROC curve plot. The larger AUC, the better the model.
According to the feature importance of this three models, the most important department is the produce department. As we have seen above in the table of number of purchases per department, the produce department is the leading one with almost twice the number of the second one in that table. This could be one of the main reasons why all three models have chosen it as the most relevant. It is important to mention that if we shuffle this variable, the AUC of the model will have the largest loss.
For the model Multinomial Logistic Regression, we use the Root Mean Square Error(RMSE) to compare the model quality of shuffling different variables. According to the plot, the most important variable is dairy eggs. If we shuffle this variable, the RMSE of the model will have the largest increase.
In this section, we will work with unsupervised learning methods i.e. Clustering and PCA to learn how we can reduce the dimensionality of the original data set and how to group data by similarity/or dissimilarity of all the features. Then, we will also study further by using a hybrid supervised/unsupervised learning method to perform the prediction (supervised) by using the result from the PCA analysis (unsupervised).
In this section, we will study clustering approaches, Hierarchical clustering and Partitioning methods, to find groups of instances/observations that have similar features.
Due to the limitation of the clustering functions in R, the execution time when we tried to cluster all instances (131,206 instances) was very long. Since, in this exercise, we would like to focus on the approaches/methodologies, we will randomly choose only 1% of the instances (1,312 instances) to perform the analysis in order to reduce the execution time.
Distance
First, we apply an Agglomerative nesting (AGNES) and compute the distances using Euclidean distances because all our features are numerical.
| Var1 | Var2 | value |
|---|---|---|
| 1 | 1 | 0.00 |
| 2 | 1 | 9.95 |
| 3 | 1 | 6.00 |
| 4 | 1 | 9.16 |
| 5 | 1 | 7.14 |
| 6 | 1 | 10.95 |
| 7 | 1 | 7.62 |
| 8 | 1 | 8.72 |
| 9 | 1 | 9.90 |
| 10 | 1 | 6.00 |
| 11 | 1 | 4.47 |
| 12 | 1 | 11.53 |
| 13 | 1 | 7.75 |
| 14 | 1 | 7.00 |
| 15 | 1 | 11.49 |
| 16 | 1 | 11.66 |
| 17 | 1 | 2.45 |
| 18 | 1 | 9.27 |
| 19 | 1 | 11.87 |
| 20 | 1 | 7.07 |
| 21 | 1 | 9.80 |
| 22 | 1 | 11.96 |
| 23 | 1 | 9.70 |
| 24 | 1 | 10.58 |
| 25 | 1 | 9.33 |
| 26 | 1 | 6.00 |
| 27 | 1 | 8.00 |
| 28 | 1 | 9.22 |
| 29 | 1 | 9.75 |
| 30 | 1 | 10.77 |
| 31 | 1 | 11.40 |
| 32 | 1 | 5.75 |
| 33 | 1 | 10.34 |
| 34 | 1 | 5.83 |
| 35 | 1 | 9.54 |
| 36 | 1 | 10.91 |
| 37 | 1 | 9.59 |
| 38 | 1 | 9.95 |
| 39 | 1 | 8.94 |
| 40 | 1 | 6.33 |
| 41 | 1 | 11.09 |
| 42 | 1 | 8.89 |
| 43 | 1 | 11.87 |
| 44 | 1 | 9.11 |
| 45 | 1 | 12.49 |
| 46 | 1 | 11.79 |
| 47 | 1 | 11.00 |
| 48 | 1 | 9.64 |
| 49 | 1 | 15.46 |
| 50 | 1 | 7.68 |
> Dendrogram
We then apply the dendrogram using the complete linkage to visualized the output of the hierarchical clustering. Since it’s difficult to read the original dendrogram, we will selected the optimal number of clusters and cut the tree branches in the next steps.
Choice of the number of clusters
We will choose the optimal number of clusters from statistics. We will apply the within-cluster sum of squares, the GAP statistics and the silhouette using complete linkage on Euclidean distance.
From the graph we can interpret the following:
For those reasons we have decided to choose 3 as a number of clusters and cut the trees as follows.
Interpretation of the clusters
We will analyze the clusters by using the box plot for each feature.
Our observations are as follows:
In this section, we will apply partitioning methods, K-means and Partitioning Around the Medoid (PAM). For the partitioning methods, we first need to identify the number of clusters and then use the chosen number of clusters to perform the analysis.
K-means
We will use WSS, silhouette and the Gap statistic to determine the number of clusters used for K-means. It’s important to note that K-means is suitable for numerical features only. Since all our features are numerical, it’s appropriate to perform the K-means analysis.
Therefore, the number 2 is an optimal number of clusters. Afterward, we plot the box plot to distinguish the characteristic of those 2 clusters. We observe that cluster 2 has higher average numbers of purchases than cluster 1 in every department that the median of numbers of purchases are higher than zero such as canned.good, dairy.eggs, produce, beverages, deli, frozen, pantry, snacks, bakery, meat.seafood and dry.goods.pasta.
Next, we will show a scatter plot along the first and second principal components and group by 2 clusters using K-means. We will see that PC1 can distinguish the clusters quite well. Cluster 1 has higher PC1 than Cluster 2.
Regarding the principal components, the percentages of variance of PC1 and PC2 are 47% and 14% respectively, which are close to what we observed from PCA analysis in the EDA section however, please note that the numbers are not the same because we chose only 1% of the features from the original data set for this clustering exercise due to the computer capacity limitation.
Partitioning Around the Medoid (PAM)
Similar to K-means, we will need to find an optimal number of clusters before performing the analysis. We will Silhouette to determine the optimal k.
From the graph below, we find that the optimal number of clusters is 2.
Then, we will plot silhouette to show the silhouettes of all the instances and the average silhouette.
From the graph, we see that cluster 2 is well formed (well separated from Cluster 1, with the average silhouette of 0.46). Cluster 1 is less homogeneous with an average silhouette of 0.16 only. The average silhouette of the data set is 0.4.
Afterward, we plot the box plot to distinguish the characteristic of those 2 clusters. We observe that cluster 1 has higher average numbers of purchases than cluster 2. It’s interesting to see that the characteristics between 2 clusters from PAM are very similar to the clusters from K-means.
In our data set, there are 134 aisles which are grouped into 21 departments. So far in our supervised learning approach, we focus on the number of purchase per department. In this section, we would like to combine supervised and unsupervised learning approaches as follows.
Grouping aisles by using PCA: Our assumption is that grouping aisles by PCA might better reflect a purchase pattern of customer than when grouping by department.
Performing supervised learning approach from the first step
Non-scaled PCA (Covariance)
We observe that the first and second components explain 23.98% and 9.02% of variance of the data. Referring to the rule of thumb which selects the number of dimensions that allow to explain at least 75% of the variation, therefore comp1 - comp28 are selected and around 75.5% of variance of the data are explained.
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 4.211 | 23.985 | 24.0 |
| comp 2 | 1.585 | 9.027 | 33.0 |
| comp 3 | 0.889 | 5.061 | 38.1 |
| comp 4 | 0.668 | 3.806 | 41.9 |
| comp 5 | 0.531 | 3.022 | 44.9 |
| comp 6 | 0.487 | 2.772 | 47.7 |
| comp 7 | 0.442 | 2.516 | 50.2 |
| comp 8 | 0.370 | 2.106 | 52.3 |
| comp 9 | 0.350 | 1.993 | 54.3 |
| comp 10 | 0.303 | 1.724 | 56.0 |
| comp 11 | 0.293 | 1.671 | 57.7 |
| comp 12 | 0.276 | 1.573 | 59.3 |
| comp 13 | 0.266 | 1.513 | 60.8 |
| comp 14 | 0.241 | 1.372 | 62.1 |
| comp 15 | 0.217 | 1.233 | 63.4 |
| comp 16 | 0.206 | 1.172 | 64.5 |
| comp 17 | 0.194 | 1.102 | 65.6 |
| comp 18 | 0.183 | 1.040 | 66.7 |
| comp 19 | 0.181 | 1.029 | 67.7 |
| comp 20 | 0.175 | 0.998 | 68.7 |
| comp 21 | 0.170 | 0.967 | 69.7 |
| comp 22 | 0.163 | 0.931 | 70.6 |
| comp 23 | 0.152 | 0.867 | 71.5 |
| comp 24 | 0.149 | 0.849 | 72.3 |
| comp 25 | 0.144 | 0.818 | 73.1 |
| comp 26 | 0.141 | 0.806 | 74.0 |
| comp 27 | 0.139 | 0.790 | 74.7 |
| comp 28 | 0.130 | 0.742 | 75.5 |
> Scaled PCA (Correlation)
We find that the first and second components can explain only 3.2% and 1.8% respectively, and we need 93 components (out of 134) to explain 75% of the variation. This means that correlations between aisles are very low and we cannot use PCA to reduce the dimensions of the scaled data.
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 4.27 | 3.186 | 3.19 |
| comp 2 | 2.43 | 1.814 | 5.00 |
| comp 3 | 1.93 | 1.439 | 6.44 |
| comp 4 | 1.68 | 1.257 | 7.70 |
| comp 5 | 1.56 | 1.168 | 8.86 |
| comp 6 | 1.46 | 1.088 | 9.95 |
| comp 7 | 1.42 | 1.059 | 11.01 |
| comp 8 | 1.33 | 0.990 | 12.00 |
| comp 9 | 1.27 | 0.948 | 12.95 |
| comp 10 | 1.25 | 0.937 | 13.88 |
All in all, we can see that scaled PCA cannot reduce dimensions of the data set. Although the disadvantage of the non-scaled PCA is that it tends to include information from variables that have high variance and high correlation to the others and the model tends to neglect variables that have low variance and low correlation, we can derive the benefit of PCA, which is to reduce the dimension. Also, our objective in our research is to support demand forecasting such as inventory arrangement. The high variation departments and aisles tend to more important to this purpose. Thus, we focus on non-scaled PCA.
In this section, we will apply a supervised learning approach with the output from the PCA (non-scale) analysis. From the PCA analysis, we select the first 28 principal components, which can explain over 75% of the total variation. For the supervised learning approach that will be performed in this section, we will use the logistic regression which is the best model from the supervised learning section.
According to the result from the original data set (without PCA) in the supervised learning section, sensitivity, specificity, kappa and accuracy are 0.628, 0.479, 0.103 and 0.57 respectively.
From the confusion matrix of the hybrid approach, we find that the sensitivity (0.628) and the accuracy (0.57) are equivalent to the result from logistic regression. However, the specificity (0.463) and kappa (0.088) are slightly lower than the result of logistic regression. Thus, we conclude that this method doesn’t improve the quality of the model.
For the supervised learning model, we use sensitivity, specificity, balanced accuracy, overall accuracy and kappa to compare the quality of different models.
For the unsupervised learning model, we perform clustering and PCA analysis to group a set of instances/features that share some common characteristics. In addition, we also use the result from the PCA analysis to perform a hybrid supervised/unsupervised learning model to see if we can use PCA to improve the performance of the supervised learning model.
One day of the week - Unbalanced data
According to table1, table2, the overall accuracy and kappa of decision tree and multinomial regression are very low. What’s more, there is a serious unbalance between sensitivity and specificity. The quality of both models is pretty similar and needs to be improved.
Accuracy: decision tree:O.211 multinomial logistic regression:O.213
kappa:decision tree:O.016 multinomial logistic regression:O.01
Table 1 - Decision Tree
|
|
Table 2 - Multinomial Logistic Regression
|
|
One day of the week - balanced data
According to table3 and table4, after we balanced the data and do the cross-validation, the difference between sensitivity and specificity is smaller. Even though the accuracy is a little lower, the kappa of both decision tree and multinomial logistic regression are higher.
Accuracy: decision tree:O.195 multinomial logistic regression:O.176
kappa:decision tree:O.035 multinomial logistic regression:O.03
Table 3 - with cross-validation: Decision Tree
|
|
Table 4 - Multinomial Logistic Regression
|
|
Weekdays and Weekend-Balanced data with Cross-Validation
According to the table5, the balanced accuracy and kappa is higher after we aggregate the days in a week into weekday and weekends. We will finally choose logistic regression model for it has highest balanced accuracy and kappa.
Table 5 - The scores of three models
| Sensitivity | Specifity | Accuracy | Balanced_accuracy | Kappa | |
|---|---|---|---|---|---|
| Decison Tree | 0.488 | 0.614 | 0.533 | 0.551 | 0.091 |
| Random Forest | 0.609 | 0.489 | 0.567 | 0.549 | 0.093 |
| Logistic Regression | 0.628 | 0.479 | 0.575 | 0.554 | 0.103 |
Clustering
Clustering is a method that divides categorized or uncategorized data into similar groups or clusters. We used 2 methods for clustering, including Hierarchical clustering and Partitioning, and chose the number of clusters from statistics. We found that the number of clusters highly depend on the statistics chosen. For example, for the Partitioning method, the optimal number of clusters from Gap statistic is 13, while the optimal number from Silhouette is only 2.
For Hierarchical clustering, we chose 3 as an optimal number of clusters. The main characteristics of each cluster are, for example, high “produce” for cluster 1, high “snacks” and “beverages” for cluster 3. The average numbers of purchases from cluster 1 and 3 are significantly higher than cluster 2 in every department.
For Partitioning methods, we chose 2 as an optimal number of clusters for both K-mean and PAM models. For K-mean, we found that cluster 2 has higher average numbers of purchases than cluster 1 in every department. This means that cluster 2 in K-mean has similar characteristics as cluster 1 and 3 from the Hierarchical clustering. For PAM, the results are very close to the results from K-mean.
PCA
To derive the benefits of PCA and to test our hypothesis that grouping aisles reflects purchasing customer behavior better than grouping departments, we perform PCA by aisles. Non-scaled PCA is our focus to set the environment in line with the supervised learning analysis. As for the result, component1 to component28 (out of 134 components) can explain 75.5% of the total variation of the data. Fresh vegetable, fresh fruits and packaged vegetables fruits are the top three most variance and contributors to component1 and 2. Scaled PCA is also performed and we need 93 components out of 134 components to explain at least 75% of the total variation. Thus, we cannot derive the benefits of PCA and it also shows that each aisle is not correlated. Afterward, hybrid supervised and unsupervised learning approach grabs our attention. The 1-28 components are applied to the best model from supervised learning approach which is the logistic regression model. However, we find that the hybrid method doesn’t improve the accuracy of the prediction.
The data set provided by Instacart contains only a few variables, including products purchased per order, product description, day of the week of orders, and hours of orders. We found that this information is still not sufficient to create predictive models with high accuracy for commercial uses. However, Amazon is an ideal case study showing that, with high quality and comprehensive data set, it’s even possible to use machine learning for a predictive analysis to anticipate and ship products before customers actually order it.
We also find the limitation when running machine learning functions in R with too many observations. In our study, we focus only on the training data set, which contains over 100k observations and do not include the prior data set, which represents all historical orders before the training set and contains over a million observations. This is because the execution time is extremely long and there are errors regularly when we run the models with all historical data (prior + training).
Machine learning and Artificial intelligence have play a key role in the boom of the e-commerce industry. There are a many applications that we can apply machine learning techniques in this industry. For example, Purchase and Repurchase prediction for predicting what orders customers will purchase next, Recommendation system for recommending products that users might would like to purchase. This system is very important behind the success of Amazon and even Netflix. There are also other applications such as Fraud prediction and Marketing campaign. However, in order to build these models with sufficient accuracy for commercial use, it’s important to we need more data such as customer personal information, browsing history, click history to get a better understanding of customer behaviors.
Ivo Bernardo, Classification Decision Trees, Easily Explained, Aug 30, 2021
Saikumar Talari, Random Forest® vs Decision Tree: Key Differences, February 18, 2022
Wikipedia, Multinomial logistic regression, 23 August 2021
S. Walusala W, R. Rimiru, C. Otieno, A hybrid machine learning approach for credit scoring using PCA and logistic regression, International journal of computer, ISSN 2307-4523.
Jeremy Stanley, 3 Million Instacart Orders, Open Sourced, May 3, 2017
Bigcommerce Blog, Nick Shaw, Ecommerce Machine Learning: AI’s Role in the Future of Online Shopping
The Economic Time, Amazon may predict and ship your order before you place it, Jan 27, 2014